[Not-for-merge] Try to implement kernels along axis with one kernel for IndexAxis0 #668
+421
−58
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Just to show the idea of implementing such kernels (e.g. in
Append
/Stack
) with one kernel, there are still a lot of TODOs, just make a PR so that we can fix them later as this task is not a P1 task currently.Here are the benchmarks:
The header is
IndexAxis0_(New)_NumAxes_Dim0_NumElements_..._AverageTime
We can see that for large size the new approach will be much slower for the old one, one main reason is that we did not process multiple elements in one thread as the old approach did (so that we can save the time to index
new_offsets
and.old_offsets
). ModernGpu did not support this, we need to write kernel by ourselves. Another reason is that there are a fewif
in the kernel.